skip to main content
research-article

Constructing Novel Block Layouts for Webpage Analysis

Authors Info & Claims
Published:10 July 2019Publication History
Skip Abstract Section

Abstract

Webpage segmentation is the basic building block for a wide range of webpage analysis methods. The rapid development of Web technologies results in more dynamic and complex webpages, which bring new challenges to this area. To improve the performance of webpage segmentation, we propose a two-stage segmentation method that can combine visual, logic, and semantic features of the contents on a webpage. Specifically, we devise a new model to measure the similarities of the elements on webpages based on both visual layout and logic organization in the first stage, and we propose a novel block regrouping method using semantic statistics and visual positions in the second stage. This two-stage method can effectively conduct webpage segmentation on complicated and dynamic webpages. The performance and accuracy of the method are verified by comparing with two existing webpage segmentation methods. The experiment results show that the proposed method significantly outperforms the existing state of the art in terms of higher precision, recall, and accuracy.

References

  1. Alexa. 2016. The top 500 sites on the web. Retrieved from http://www.alexa.com/topsites.Google ScholarGoogle Scholar
  2. Shumeet Baluja. 2006. Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In Proceedings of the 15th International Conference on World Wide Web. ACM, 33--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ziv Bar-Yossef and Sridhar Rajagopalan. 2002. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web. ACM, 580--591. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Lidong Bing, Rui Guo, Wai Lam, Zheng-Yu Niu, and Haifeng Wang. 2014. Web page segmentation with structured prediction and its application in web page classification. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval. ACM, 767--776. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ahmet Selman Bozkir and Ebru Akcapinar Sezer. 2018. Layout-based computation of web page similarity ranks. International Journal of Human-Computer Studies 110 (2018), 95--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. VIPS: A Visionbased Page Segmentation Algorithm. Technical Report. Microsoft technical report, MSR-TR-2003-79.Google ScholarGoogle Scholar
  7. Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2007. Page-level template detection via isotonic smoothing. In Proceedings of the 16th International Conference on World Wide Web. ACM, 61--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2008. A graph-theoretic approach to webpage segmentation. In Proceedings of the 17th International Conference on World Wide Web. ACM, 377--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yu Chen, Wei-Ying Ma, and Hong-Jiang Zhang. 2003. Detecting web page structure for adaptive viewing on small form factor devices. In Proceedings of the 12th International Conference on World Wide Web. ACM, 225--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. dataset-popular 2014. A dataset of popular pages (taken from dir.yahoo.com) with manually marked up semantic blocks. Retrieved from https://github.com/rkrzr/dataset-popular.Google ScholarGoogle Scholar
  11. dataset-random 2014. A dataset of random pages with manually marked up semantic blocks. Retrieved from https://github.com/rkrzr/dataset-random.Google ScholarGoogle Scholar
  12. Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, Vol. 96. 226--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Evernote. 2016. Evernote Web Clipper. Retrieved from https://evernote.com/webclipper/.Google ScholarGoogle Scholar
  14. ExtJs. 2016. Sencha Ext JS. Retrieved from https://www.sencha.com/products/extjs/.Google ScholarGoogle Scholar
  15. Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. 2003. DOM-based content extraction of HTML documents. In Proceedings of the 12th International Conference on World Wide Web. ACM, 207--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. PhantomJS - Scriptable Headless WebKit. https://github.com/ariya/phantomjs.Google ScholarGoogle Scholar
  17. Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou. 2015. Short text understanding through lexical-semantic analysis. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering. IEEE, 495--506.Google ScholarGoogle ScholarCross RefCross Ref
  18. Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193--218.Google ScholarGoogle ScholarCross RefCross Ref
  19. Zexun Jiang, Ruifeng Kuang, Jiaying Gong, Hao Yin, Yongqiang Lyu, and Xu Zhang. 2018. What makes a great mobile app? A quantitative study using a new mobile crawler. In Proceedings of the 2018 IEEE Symposium on Service-Oriented System Engineering (SOSE). IEEE, 222--227.Google ScholarGoogle ScholarCross RefCross Ref
  20. Christian Kohlschütter and Wolfgang Nejdl. 2008. A densitometric approach to web page segmentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 1173--1182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Rupesh R. Mehta, Pabitra Mitra, and Harish Karnick. 2005. Extracting semantic structure of web documents using content and visual information. In Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. ACM, 928--929. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. William M. Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 336 (1971), 846--850.Google ScholarGoogle ScholarCross RefCross Ref
  23. React. 2017. A JavaSscript Library for Building User Interfaces. Retrieved from https://facebook.github.io/react/.Google ScholarGoogle Scholar
  24. Andres Sanoja and Stephane Gancarski. 2014. Block-o-matic: A web page segmentation framework. In Proceedings of the International Conference on Multimedia Computing and Systems (ICMCS’14). IEEE, 595--600.Google ScholarGoogle ScholarCross RefCross Ref
  25. Yayuan Tang, Hao Wang, Kehua Guo, Yizhe Xiao, and Tao Chi. 2018. Relevant feedback based accurate and intelligent retrieval on capturing user intention for personalized websites. IEEE Access 6 (2018), 24239--24248.Google ScholarGoogle ScholarCross RefCross Ref
  26. Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, Joao Cavalcanti, and Juliana Freire. 2006. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, 258--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. VIPS-JAVA {n.d.}. Implementation of Vision Based Page Segmentation Algorithm in Java. Retrieved from https://github.com/tpopela/vips-java.Google ScholarGoogle Scholar
  28. Tim Weninger, William H Hsu, and Jiawei Han. 2010. CETR: Content extraction via tag ratios. In Proceedings of the 19th International Conference on World Wide Web. ACM, 971--980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yulei Wu, Fei Hu, Geyong Min, and Albert Y. Zomaya. 2017. Big Data and Computational Intelligence in Networking. CRC Press.Google ScholarGoogle Scholar
  30. Jan Zeleny, Radek Burget, and Jaroslav Zendulka. 2017. Box clustering segmentation: A new method for vision-based web page preprocessing. Information Processing 8 Management 53, 3 (2017), 735--750. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Constructing Novel Block Layouts for Webpage Analysis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Internet Technology
        ACM Transactions on Internet Technology  Volume 19, Issue 3
        Special Section on Advances in Internet-Based Collaborative Technologies
        August 2019
        289 pages
        ISSN:1533-5399
        EISSN:1557-6051
        DOI:10.1145/3329912
        • Editor:
        • Ling Liu
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 July 2019
        • Accepted: 1 April 2019
        • Revised: 1 March 2019
        • Received: 1 April 2018
        Published in toit Volume 19, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format